Probabilistic numeric convolutional neural networks

ABSTRACT

Certain aspects of the present disclosure provide techniques for performing operations with probabilistic numeric convolutional neural network, including: defining a Gaussian Process based on a mean and a covariance of input data; applying a linear operator to the Gaussian Process to generate pre-activation data; applying a nonlinear operation to the pre-activation data to form activation data; and applying a pooling operation to the activation data to generate an inference.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of and priority to U.S. Provisional Patent Application No. 63/086,339, filed on Oct. 1, 2020, the entire content of which is hereby incorporated by reference.

INTRODUCTION

Aspects of the present disclosure relate to probabilistic numeric convolutional neural networks.

Machine learning is generally the process of producing a trained model (e.g., an artificial neural network, a tree, or other structures), which represents a generalized fit to a set of training data. Applying the trained model to new data produces inferences, which may be used to gain insights into the new data.

Machine learning models are seeing increased adoption across myriad domains, including for use in classification, detection, and recognition tasks. For example, machine learning models are being used to perform complex tasks on electronic devices based on sensor data provided by one or more sensors onboard such devices, such as automatically detecting features (e.g., faces) within images.

One particularly powerful type of machine learning model is the convolutional neural network (CNN) model, which is a type of deep neural network model that can be trained to identify various features in input data based. CNNs typically rely on kernels or filters that are strided across a grid of input data, such as the grid formed by pixels in an image, through various layers of the CNN. Inherent in the conventional design of a CNN, then, is that the input data will be sampled regularly, such as in rectangular grids of input image data.

Unfortunately, not all input data is regularly sampled. For example, continuous input signals, like time series, that are irregularly sampled or which have missing values are challenging for existing deep learning model architectures, such as CNNs.

Accordingly, methods are needed to improve the performance of CNNs when processing continuous input data.

BRIEF SUMMARY

Certain aspects provide a method for performing operations with probabilistic numeric convolutional neural network, including: defining a Gaussian Process based on a mean and a covariance of input data; applying a linear operator to the Gaussian Process to generate pre-activation data; applying a nonlinear operation to the pre-activation data to form activation data; and applying a pooling operation to the activation data to generate an inference.

Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer-readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and a processing system comprising means for performing the aforementioned methods as well as those further described herein.

The following description and the related drawings set forth in detail certain illustrative features of one or more aspects.

BRIEF DESCRIPTION OF THE DRAWINGS

The appended figures depict certain aspects of the one or more aspects and are therefore not to be considered limiting of the scope of this disclosure.

FIG. 1 depicts an example of process for training a probabilistic numeric convolutional neural network.

FIG. 2 depicts an example method for training a probabilistic numeric neural network.

FIG. 3 depicts an example processing system configured to perform the various methods described herein.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the drawings. It is contemplated that elements and features of one aspect may be beneficially incorporated in other aspects without further recitation.

DETAILED DESCRIPTION

Aspects of the present disclosure provide apparatuses, methods, processing systems, and computer-readable mediums for applying probabilistic numerics to convolutional neural networks (CNNs) to improve such models' ability to process continuous input data, including irregularly sampled input data.

Continuous input signals, like time series that are irregularly sampled or have missing values, are challenging for existing deep learning methods. One reason for this is that coherently defined feature representations generally depend on the values in unobserved regions of the input of irregularly sampled data. To overcome this issue, probabilistic numeric convolutional neural networks are described herein, which represent features as Gaussian processes, providing a probabilistic description of discretization error. Such probabilistic numeric convolutional neural networks define a convolutional layer as the evolution of a partial differential equation defined on a Gaussian process, followed by a nonlinear operation. Probabilistic numeric convolutional neural networks yield significant reductions in error from the previous state of the art on well-known datasets, such as SuperPixel-MNIST.

Standard convolutional neural networks are defined on a regular input grid. For continuous signals, these elements correspond to regular samples of an underlying function ƒ defined on a continuous domain. In such cases, the standard convolutional layer of a neural network is a numerical approximation of a continuous convolution operator

.

Coherently defined networks on continuous functions should only depend on the input function ƒ, and not on spurious shortcut features, such as the sampling locations or sampling density, which enable overfitting and reduce robustness to changes in the sampling procedure. Each application of

in a standard neural network incurs some discretization error which is determined by the sampling resolution. In some sense, this error is unavoidable because the features

at the layers

depend on the values of the input function ƒ at regions that have not been observed. For input signals which are sampled at a low resolution, or even sampled irregularly (e.g., such as with the sporadic measurements of patient vitals data in ICUs or dispersed sensors for measuring ocean currents), this discretization error cannot be neglected. Simply filling in the missing data with zeros or imputing the values is not sufficient since many different imputations are possible, each of which can affect the outcomes of the network.

Probabilistic numerics is an emergent field that studies discretization errors in numerical algorithms using probability theory. As described herein, probabilistic numerics may be built upon to quantify the dependence of a model (e.g., neural network) on the regions in the input which are unknown, and integrate this uncertainty into the computation of the model. To do so, the discretely evaluated feature maps {

(x_(i))}_(i=1) ^(N) are replaced with Gaussian processes: distributions over the continuous function

that track the most likely values as well as the uncertainty. Beneficially, this Gaussian process feature representation need not resort to discretizing the convolution operator A as in a standard convolutionan neural network, but instead the continuous convolution operator may be applied directly. If a given feature is a Gaussian process, then applying linear operators yields a new Gaussian process with transformed mean and covariance functions. The dependence of

ƒ on regions of ƒ that are not known translates into the uncertainty represented in the transformed covariance function, the analogue of the discretization error in a convolutional neural network, which is now tracked explicitly. The resulting model, as described further herein, may be referred to as a probabilistic numeric convolutional neural network (PNCNN).

Probabilistic Numerics

Probabilistic numeric convolutional neural networks, as described herein, leverage probabilistic numerics in which the error in numerical algorithms are modeled probabilistically, and typically with a Gaussian process. In this framework, only a finite number of input function calls can be made, and therefore the numerical algorithm can be viewed as an autonomous agent which has epistemic uncertainty over the values of the input. One example is Bayesian Monte Carlo model where a Gaussian process is used to model the error in the numerical estimation of an integral and optimally select a rule for its computation. Probabilistic numerics has been applied to numerical problems, such as the inversion of a matrix, the solution of an ordinary differential equation, a meshless solution to boundary value partial differential equations, and other numerical problems.

Gaussian Processes

Probabilistic numeric convolutional neural networks, as described herein, operate on a continuous function ƒ(x) underlying the input based on a collection of the values of that function sampled on a finite number of points {x_(i)}_(i=1) ^(N). Classical interpolation theory reconstructs ƒ deterministically by assuming a certain structure of the signal in the frequency domain. Gaussian processes (GPs) give a way of modeling beliefs about values that have not been observed. These beliefs are encoded into a prior covariance k of the GP, ƒ˜

(0, k), and updated with Bayesian inference upon seeing data. Explicitly, given a set of sampling locations x={x_(i)}_(i=1) ^(N) and noisy observations y={y_(i)}_(i=1) ^(N) sampled y_(i)≣N(ƒ(x_(i)), σ_(i) ²), using Bayes rule, the posterior distribution ƒ|y, x˜

(μ_(p),k_(p)) may be computed, which captures the epistemic uncertainty about the values between observations. The posterior mean μ_(p)(x) and covariance k_(p)(x, x′) are given by:

μ_(p)(x)=k(x)^(T)[K+S]⁻¹ y,k _(p)(x,x′)=k(x,x′)−k(x)^(T)[K+S]⁻¹ k(x′),  (1)

where K_(ij)=k(x_(i), x_(j)), k(x)_(i)=k(x, x_(i)) and S=diag(σ_(i) ²).

In some aspects, a radial basis function (RBF) kernel (k_(RBF)) may be used to determine a prior covariance, due to its convenient analytical properties. For example:

${k_{RBF}\left( {x,x^{\prime}} \right)} = {{a\;{\mathcal{N}\left( {{x;x^{\prime}},{l^{2}I}} \right)}} = {{a\left( {2\pi\; l^{2}} \right)}^{- \frac{d}{2}}{{\exp\left( {{- \frac{1}{2l^{2}}}{{x - x^{\prime}}}^{2}} \right)}.}}}$

In typical applications of GPs to machine learning tasks, such as regression, the function ƒ that is predicted is already the regression model. In contrast, here GPs are used as a way of representing beliefs and epistemic uncertainty about the values of both the input function and the intermediate feature maps of a model (e.g., a deep neural network model).

Probabilistic Numeric Convolutional Neural Networks

Given a continuous input signal ƒ:X→

^(c), a network with layers may be defined that acts directly on this continuous input signal. In one aspect, a neural network may be defined recursively from the input ƒ⁽⁰⁾=ƒ, as a series of L continuous convolutions

with pointwise nonlinearities (e.g., ReLU) and weight matrices (

, M∈

^(c×c)) that mix only channels (known as 1×1 convolutions) according to:

=

ReLU[

],  (2)

A final global average pooling layer

may be added that acts channel-wise as a natural generalization of the discrete case:

(ƒ^((L)))_(α)=∫ƒ_(α) ^((L))(x)dx for each α=1, 2, . . . , c. Denoting the space of functions on X with c channels by H_(c), the convolution operators

are linear operators from

to

. Like in ordinary convolutional neural networks, the layers build up increasingly more expressive spatial features and depend on the parameters in

and

. Unlike ordinary convolutional networks, these layers are well defined operations on the underlying continuous signal.

While it is clear that such a network can be defined abstractly, the exact values of the function ƒ^((L)) generally cannot be computed as the operators depend on unknown values of the input. However, by adapting a probabilistic description, it is possible to formulate ignorance of ƒ⁽⁰⁾ with a Gaussian process and see how the uncertainties propagate through the layers of the network, yielding a probabilistic output. The following briefly describes important components of Equation 2 that make this possible, with more detailed descriptions below.

Continuous convolution operators

in Equation 2 can be applied to input Gaussian process ƒ˜

(μ_(p), k_(p)) in closed form. The output is another Gaussian process with a transformed mean and covariance:

ƒ˜

(

μ_(p),

k_(p)

′), where

′ acts to the left on the primed argument of k_(p) (x, x′). Below it is described how to parametrize these continuous convolutions in terms of the flow of a partial differential equation and show how they can be applied to the radial basis function kernel exactly in closed form.

Applying a ReLU nonlinearity to a Gaussian process in Equation 2 yields a new non-Gaussian stochastic process

=ReLU[

], and the mean and covariance of this process has a closed form solution which can be computed. This may generally be referred to as a probabilistic ReLU function.

The activations

in Equation 2 are not Gaussian; however, for a large number of weakly dependent channels, it can be shown that

=

is approximately distributed as a Gaussian Process, as described further below.

While

in Equation 2 is approximately a Gaussian process, the mean and covariance functions have a complicated form. Instead of using these functions directly, it is possible to take measurements of the mean and variance of this process and feed them in as noisy observations to a fresh radial basis function kernel Gaussian process, allowing the process to be repeated and to build up multiple layers without increasing complexity.

In some aspects, the Gaussian process feature maps in the final layer ƒ^((L)) are aggregated spatially by an integral pooling

that can also be applied in closed form to yield a Gaussian output. Assembling these components allow implementation of an end-to-end trainable probabilistic numeric convolutional neural network, which integrates a probabilistic description of missing data and discretization error inherent to continuous signals.

Continuous Convolutional Layers

On a discrete domain, such as the lattice X=

^(d), all translation equivariant linear operators

are convolutions. In general, these convolutions can be written in terms of a linear combination of powers of the generators of the translation group: the shift operators τ_(i), i=1, . . . , d shift all elements by one unit along the i-th axis of the grid. For a one dimensional grid, one can always write

=Σ_(k) W_(k)τ^(k) where the weight matrices W_(k)∈

^(x×x) act only on the channels and the shift operator τ acts on functions on the lattice. In d dimensions,

=Σ_(k) ₁ _(, . . . , k) _(d) τ₁ ^(k) ¹ . . . τ_(d) ^(k) ^(d) for some set of integer coefficients k₁, . . . , k_(d). For example when d=2, k₁, k₂∈{−1, 0, 1} can be taken to fill out a 3×3 neighborhood.

On the continuous domain X=

^(d), convolutions may be similarly parametrized with

=Σ_(k)W_(k)e^(D) ^(k) , where

_(k) is given by powers of the partial derivatives ∂_(i), i=1, . . . , d that generate infinitesimal translations along the i-th axes. Setting d=1 for simplicity, it can be verified that the operator exponential τ^(α)=e^(α∂x) applied to a function g(x) is a translation:

e ^(α∂x) g(x)=g(x)+ag′(x)+½α² g″(x)+ . . . =g(x+α),

which is the Taylor series expansion of g(x+α) around x. Exponentials of operators can be defined similarly in terms of the formal Taylor series

=Σ_(k=0) ^(∞)

^(k)/k! or more broadly as the solution to the partial differential equation:

∂_(t) g(t,x)=(Dg)(t,x),g(0,x)=g(x)  (3)

at time t=1:

g(x)=g(t=1,x).

Following the discussion in the discrete case, translation invariance of

_(k) imposes that it is expressed in terms of powers of the generators. Collecting the derivatives into the gradient ∇, the general form of D_(k) can be written as α_(k)+β_(k) ^(T)∇+½∇^(T)Σ_(k)∇+ . . . for any constants α_(k), vectors β_(k), matrices Σ_(k), etc. For simplicity, the series may be truncated at second order to get:

D _(k)=β_(k) ^(T)∇+½∇^(T)Σ_(k)∇,  (4)

where the constants α_(k) that can be absorbed into the definition of W_(k) are omitted. For this choice of

, the partial differential in Equation 3 is nothing but the diffusion equation with drift β_(k) and diffusion Σ_(k). When discussing rotational equivariance, below, a more general form of

is also considered.

The diffusion layer can also be viewed in another way as the infinitesimal generator of an Ito diffusion (a stochastic process). Given an Ito process with constant drift and diffusion dX_(t)=βdt+Σ^(1/2) dB_(t), where B_(t) is a d dimensional Brownian motion, the time evolution operator can be written via the Feynman-Kac formula as

ƒ(x)=[ƒ(X_(t))], where X₀=x. In other words, the operator layer

=

is the expectation under a parametrized Neural Stochastic Differential equation that is homogeneous and therefore shift invariant. The flow of this stochastic differential equation depends on the drift and diffusion parameters and E.

To recap, a convolution operator may be defined through the general form

=Σ_(k) W_(k)

, where the weight matrices W_(k)∈

^(c×c) mix only channels and

is the forward evolution by one unit of time of the diffusion equation with drift β_(k) and diffusion Σ_(k) containing learnable parameters {(W_(k), β_(k), Σ_(k))}_(k=1) ^(K). The translation equivariance of

follows directly from the fact that the generators commute ∀k, i: [

_(k), ∇_(i)]=0 and therefore [

, τ_(i)]=0 (the bracket [a, b]=ab−ba is the commutator of the two operators).

Application on Radial Basis Function Gaussian Processes

Although the application of the linear operator

=Σ_(k) W_(k)

involves the time evolution of a partial differential equation, owing to properties of the radial basis function kernel, the operator may beneficially be applied to an input Gaussian process in closed form. Gaussian processes are closed under linear transformations. For example, given ƒ˜GP(μ_(p), k_(p)), the action of

need only be computed on the mean and covariance:

ƒ˜GP (Aμ_(p), Ak_(p)A′), where A′ is the adjoint with respect to the L₂(X) inner product. The application of time evolution

is a convolution with a Green's function G_(k), so

ƒ=Σ_(k) W_(k)

ƒ=Σ_(k) W_(k)G_(k)*ƒ. In one aspect, the Green's function for

_(k)=ρ_(k) ^(T)∇+(½)∇^(T) Σ_(k)∇, is nothing but the multivariate Gaussian density G_(k)(x)=N(x;−β_(k),Σ_(k)) acccording to:

ƒ=Σ_(k) W _(k)

ƒ=Σ_(k) W _(k) G _(k)*ƒ=Σ_(k) W _(k) N(−β_(k),Σ_(k))*ƒ.  (5)

In order to apply

to the posterior Gaussian process, the operator need only be applied to the posterior mean and covariance. This posterior mean and covariance in Equation 1 are expressed in terms of k_(RBF)=α

(x; x′,

I) and the computation boils down to a convolution of two Gaussians:

k _(RBF)(x,x′)=

(x;tβ,tΣ)*α

(x;x′,

I)=α

(x;x′−tβ,

I+tΣ)  (6)

k _(RBF)(x,x′)e

=α

(x;x′−t(β₁−β₂),

I+tΣ ₁ +tΣ ₂).  (7)

The application of the channel mixing matrices W_(k) and summation is also straightforward through matrix multiplication for the mean and covariance. To summarize, because of the closed form action on the radial basis function kernel, the layer can be implemented efficiently and exactly with no discretization or approximations.

Note with respect to Green's function above that the action of

encompasses the ordinary convolution operator on the 2d lattice as a special case. For example, given drift β_(k)∈{−1, 0, 1}^(×2), k=1, . . . , 9 filling out the 9 elements of a 3×3 grid and as the diffusion Σ_(k)→0, the Green's function is a Dirac delta, so that:

ƒ(x)=Σ_(k) W _(k)δ(x−β _(k))*ƒ(x)=Σ_(i,j=−1,0,1) W _(ij)ƒ(x ₁ −i,x ₂ −j)=W

ƒ(x)

General Equivariance

The convolutional layers discussed so far are translation equivariant, but it is possible to extend the continuous linear operator layers to more general symmetries, such as rotations. Feature fields in this more general case are described by tensor fields, where the symmetry group acts not only on the input space X but also on the vector space attached to each point x∈X. A linear layer

is equivariant if its action commutes with that of the symmetry. It is possible to derive constraints for general linear operators and symmetries, which generalize those known in the context of steerable convolutional neural networks.

Probabilistic Nonlinearities and Rectified Gaussian Processes

It is possible to derive the mean and variance for a univariate rectified Gaussian distribution for use in a neural network. This can then be generalized to the full covariance function (and higher moments) of a rectified Gaussian process.

For example, for an input GP

(x)˜GP (μ(x), k(x, x′)), the standard deviation may be denoted σ(x)=√{square root over (k(x, x))}, the matrix with components Σ_(ij)=k (x_(i), x_(j)) for i, j=1, 2, and the mean μ=[μ(x₁), μ(x₂)]. The notation Φ(z) may be used for the univariate standard normal cumulative distribution function (CDF), and Φ(z; Σ) may be used for (two-dimensional) multivariate CDF of N(0, Σ) at z. Σ₁ and Σ₂ are the column vectors of Σ. The first and second moments of h=ReLU[

ƒ] are:

[h(x)]=μ(x)Φ(μ(x)/σ(x))+σ(x)Φ′(μ(x)/σ(x)),  (8)

[h(x ₁)h(x ₂))]=(k(x ₁ ,x ₂)+μ(x ₁)μ(x ₂))Φ(μ;)+(μ(x ₁)Σ₂ ^(T)+μ(x ₂)Σ₁ ^(T))∇Φ(μ;)+Σ₁ ^(T)∇∇^(T)Φ(μ;)Σ₂  (9)

where ∇∇^(T)Φ denotes the Hessian of Φ with respect to the first argument. The first and higher order derivatives of the Normal CDF are just the probability distribution function (PDF) and products of the PDF with Hermite polynomials. Note that the mean and covariance interact through the nonlinearity.

Channel Mixing and Central Limit Theorem

After the nonlinearity is applied (e.g., probabilistic ReLU), the process is no longer Gaussian. To overcome this issue, a channel mixing matrix

∈

is introduced, and the feature map is defined in the following layer by

=

, where

=ReLU[

]. So long as the channels of

are only weakly dependent, the central limit theorem (CLT) may be applied to each function according to

=

so that in the limit of large

, the statistics of the

's converge to a Gaussian process with first and second moments given by:

[

(x)]=M

[

(x)],

[

(x)

(x′)^(T)]=

M

[

(x)

(x′)^(T)]M ^(T)  (10)

The convergence to a Gaussian process here is reminiscent of the well-known infinite width limit of Bayesian neural networks. However the setting here is fundamentally different. Unlike the Bayesian case where the distribution of M is given by a prior or posterior, in the case of a probabilistic numeric convolutional neural network (PNCNN), M is a deterministic quantity and instead the uncertainty is about the input. Thus, a PNCNN is not a Bayesian method in the sense of representing uncertainty about the parameters of the model, but instead it is Bayesian in representing and propagating the uncertainty in the value of the inputs.

Measurement and Projection to RBF Gaussian Process

As a last step, the mean and covariance functions of the approximate GP

are simplified. While it is possible to compute the values of these functions, unlike in the RBF kernel case, it is not possible to apply the convolution operator

in closed form. In order to circumvent this challenge, the (approximately) Gaussian process

is modeled with an RBF Gaussian process as follows. First, the mean y_(i)=

[

(x_(i))] and variance σ_(i) ²=

ar[

(x_(i))] of the approximate Gaussian process

are evaluated at a collection of points {x_(i)}_(i=1) ^(N) using Equations 8, 9 and 10. These values y_(i) are treated as measurements of the underlying signal with a heteroscedastic noise σ_(i) ² that varies from point to point. Second, the RBF-based posterior GP of this signal {circumflex over (f)}|{(x_(i), y_(i), σ_(i))}_(i=1) ^(N)˜

(μ_(p),k_(p)) with posterior mean and covariance given by (1) is computed for the heteroscedastic noise model. The uncertainty in the input

is propagated through to the RBF posterior {circumflex over (f)}|{(x_(i), y_(i), σ_(i))}_(i=1) ^(N) via the measurement noise σ_(i). Notably, this Gaussian process mean and covariance functions are written in terms of the RBF kernel and therefore it is possible to continue applying convolutions in closed form in future layers.

As described further below, the RBF kernel in each layer is trained to maximize the marginal likelihood of the data that it sees, and thereby minimize the discrepancy with the underlying generating distribution

. This measurement/projection approach is effective in many scenarios.

Training Procedure

An example neural network model, such as depicted in FIG. 1, may have two sets of parameters: the channel mixing and diffusion parameters, {(

,

,

,

)}

, as well as kernel hyperparameters of the Gaussian Processes {(

,

)}

. In some aspects, all parameters are trained jointly on the loss L_(task)+λ

, where L_(task) is the cross entropy with logits given by the mean μ_(p) of the pooled features P(ƒ^((L)))˜N(μ_(p),Σ_(p)) and

is the marginal log likelihoods of the GP feature maps:

(ƒ)=½

[(f _(α) ^(T)[K _(XX) +S _(α)]⁻¹ f _(α))+log det[K _(XX) +S _(α)]+N log 2π]

  (11)

where for each layer

, f_(α)=[f_(α) (x_(i)), . . . , ƒ_(α)(x_(N))]Σ

^(N) are the observed values for channel α at locations X=[x₁, . . . , x_(N)], K_(XX) is the covariance of the RBF kernel and S_(α)=diag(σ_(α) ²) the measurement noise for each channel a and spatial location, and logdet[·] is a log determinant function. Notably the GP marginal likelihood is independent of the class labels.

Example Probabilistic Numeric Convolutional Neural Network Architecture

FIG. 1 depicts an example 100 of a probabilistic numeric convolutional neural network architecture. In the depicted example, input data is superpixel data, which is an example of sparsely sampled continuous data, though in other aspects, other types of input data, including other types of sparsely (or irregulary) sampled input data may be used. Generally, the mean and elementwise uncertainty of the Gaussian process feature maps are shown as they are transformed through the network by the convolution layers. Observation points shown as dots in σ(x).

As depicted, in a first convolution layer 104, a Gaussian process 106 is determined based on the input data x, which includes a determined mean function μ₁(x) 108 and a determined standard deviation σ₁(x) 110. The standard deviation σ₁(x) may then be used to determine a covariance. In one aspect, as described above, a covariance kernel k(x, x′) can be determined as above in Equation 1.

This Gaussian process 106 in layer 104 is used to interpolate the data input to linear operator

⁽¹⁾ 112, which in-turn generates pre-activation data, and thus substitutes for a conventional convolution layer. In some aspects, linear operator

⁽¹⁾ is implemented as a diffusion process, which replaces learnable parameters in the convolutional neural network, as described above.

Then, a pointwise nonlinear operation (probabilistic ReLU in this example) is applied to the pre-activation data to generate activations, and then channel mixing is performed, as described above. Finally, the current Gaussian process is evaluated (measured) at a given set of points and the uncertainty of Gaussian process is treated as a heteroscedastic noise model. This is the output of the layer that can be then passed on to the next layer, which yields a new Gaussian process for the second layer, with transformed mean function μ₂ (x) and standard deviation σ₂ (x).

This process is repeated through a plurality of layers (four in this example) and ends with an integral pooling operation (as discussed above) at 114. The output 116 of the model and process 100 in this example is a random variable

with mean μ and uncertainty Σ.

As depicted, during training, the cross-entropy loss of the model output is minimized along with the sum of the marginal log likelihoods (MLL) of each layer, for example, according to Equation 11 above. However, in other aspects, the cross-entropy loss is first minimized, followed by the sum of the marginal log likelihoods.

Example Method for Performing Operations with a Probabilistic Numeric Convolutional Neural Network Model

FIG. 2 depicts an example method 200 for performing operations with a probabilistic numeric neural network.

Method 200 begins at step 202 with receiving input data (e.g., x). In some aspects, the input data is in the form of a vector-valued function (e.g., ƒ(x)).

Method 200 then proceeds to step 204 with calculating a mean of the input data (e.g., μ(x)).

Method 200 then proceeds to step 206 with calculating a covariance of the input data (e.g., k(x, x′)).

Method 200 then proceeds to step 208 with determining a Gaussian process based on the mean and the covariance of the input data (e.g., GP (μ(x), k(x, x′))), where k(x, x′)=σ² (x).

Method 200 then proceeds to step 210 with applying a linear operator (

) to the Gaussian process to generate pre-activation data. In one aspect, this may be performed according to

[ƒ]˜GP(

μ(x),

k(x, x′)A^(†)).

Method 200 then proceeds to step 212 with applying a nonlinear operation to the pre-activation data to form activation data (e.g., σ(

[ƒ]), where a is a nonlinear operator such as ReLU). In some embodiments, a channel mixing may further be performed to the activation data, such as according to Equation 10 above. In some aspects, step 212 may be performed iteratively across two or more layers of a model.

Method 200 then proceeds to step 214 with applying a pooling operation to the activation data to generate an inference. In some aspects, the pooling operation is an integral pooling operation, such as described above. In some aspects, the inference is in the form of a random variable

with mean μ and uncertainty Σ (e.g. N(μ, Σ)), such as 116 in FIG. 1.

During a training phase, method 200 may then optionally proceed to step 216 with calculating a loss based on the inference.

Further during a training phase, method 200 may then optionally proceed to step 218 with training parameters of the linear operator (e.g.,

) based on the loss.

In some aspects of method 200, applying a linear operator to the Gaussian process comprises applying a diffusion equation (e.g., e^(tD)ƒ(x)).

In some aspects of method 200, the loss comprises a cross-entropy component.

In some aspects of method 200, the loss further comprises a marginal log likelihood component associated with the Gaussian process; and the method further comprises: training parameters of the Gaussian process based on the marginal log likelihood component associated with the Gaussian process.

In some aspects of method 200, training parameters of the linear operator comprises performing gradient descent on the parameters of the linear operator.

In some aspects of method 200, training parameters of the Gaussian process comprises performing gradient descent on the parameters of the Gaussian process.

In some aspects of method 200, the nonlinear operation comprises a probabilistic ReLU operation.

In some aspects of method 200, the input data comprises irregularly sampled data (e.g., {(x_(i), ƒ(x_(i)))}_(i=1) ^(N)).

Example Processing System

FIG. 3 depicts an example processing system 300 configured to perform the various methods described herein, including, for example, with respect to FIGS. 1 and 2.

Processing system 300 includes a central processing unit (CPU) 302, which in some examples may be a multi-core CPU. Instructions executed at the CPU 302 may be loaded, for example, from a program memory associated with the CPU 302 or may be loaded from memory 314.

Processing system 300 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 304, a digital signal processor (DSP) 306, and a neural processing unit (NPU) 308.

An NPU, such as 308, is generally a specialized circuit configured for implementing all the necessary control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs), deep neural networks (DNNs), random forests (RFs), probabilistic numeric convolutional neural networks (PNCNNs), and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP), tensor processing unit (TPU), neural network processor (NNP), intelligence processing unit (IPU), vision processing unit (VPU), or graph processing unit.

NPUs, such as 308, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC), while in other examples they may be part of a dedicated neural-network accelerator.

NPUs may be optimized for training or inference, or in some cases configured to balance performance between both. For NPUs that are capable of performing both training and inference, the two tasks may still generally be performed independently.

NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged), iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance. Generally, optimizing based on a wrong prediction involves propagating back through the layers of the model and determining gradients to reduce the prediction error.

NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process it through an already trained model to generate a model output (e.g., an inference).

In one implementation, NPU 308 is a part of one or more of CPU 302, GPU 304, and/or DSP 306.

In some examples, connectivity component 312 may include various subcomponents, for example, for wide area network (WAN), local area network (LAN), Wi-Fi connectivity, Bluetooth connectivity, and other data transmission standards.

Processing system 300 may also include one or more input and/or output devices 310, such as screens, touch-sensitive surfaces (including touch-sensitive displays), physical buttons, speakers, microphones, and the like.

In some examples, one or more of the processors of processing system 300 may be based on an ARM or RISC-V instruction set.

Processing system 300 also includes memory 314, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like. In this example, memory 314 includes computer-executable components, which may be executed by one or more of the aforementioned processors of processing system 300.

In this example, memory 314 includes Gaussian process component 314A, linear operator component 314B, nonlinear operation component 314C, pooling component 314D, measuring component 314E, loss calculation component 314F, training component 314G, inferencing component 314H, model parameters 3141, and models 314J. The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.

Generally, processing system 300 and/or components thereof may be configured to perform the methods described herein, including methods described with respect to FIGS. 1 and 2.

Notably, in other aspects, processing system 300 may include additional, alternative, or fewer elements. Further, various aspects of methods described above may be performed on one or more processing systems.

Example Clauses

Implementation examples are described in the following numbered clauses:

Clause 1: A method of performing operations with a probabilistic numeric neural network, comprising: defining a Gaussian Process based on a mean and a covariance of input data; applying a linear operator to the Gaussian Process to generate pre-activation data; applying a nonlinear operation to the pre-activation data to form activation data; and applying a pooling operation to the activation data to generate an inference.

Clause 2: The method of Clause 1, wherein the inference comprises a random variable.

Clause 3: The method of any one of Clauses 1-2, wherein applying a linear operator to the Gaussian process comprises applying a diffusion equation to the Gaussian process.

Clause 4: The method of any one of Clauses 1-3, further comprising: calculating a loss based on the inference; and training parameters of the linear operator based on the loss.

Clause 5: The method of Clause 4, wherein: the loss further comprises a cross entropy component.

Clause 6: The method of Clause 5, wherein: the loss further comprises a marginal log likelihood component associated with the Gaussian process, and the method further comprises: training parameters of the Gaussian process based on the marginal log likelihood component associated with the Gaussian process.

Clause 7: The method of Clause 5, wherein training parameters of the linear operator comprises performing gradient descent on the training parameters of the linear operator.

Clause 8: The method of any one of Clauses 6-7, wherein training parameters of the Gaussian process comprises performing gradient descent on the training parameters of the Gaussian process.

Clause 9: The method of any one of Clauses 1-8, wherein the nonlinear operation comprises a probabilistic ReLU operation.

Clause 10: The method of any one of Clauses 1-9, wherein the input data comprises irregularly sampled data.

Clause 11: The method of any one of claims 1-10 wherein the linear operator comprises

=Σ_(k) W_(k)

and applying the linear operator to pre-activation data is performed according to Equation 5.

Clause 12: A processing system, comprising: a memory comprising computer-executable instructions; and one or more processors configured to execute the computer-executable instructions and cause the processing system to perform a method in accordance with any one of Clauses 1-11.

Clause 13: A processing system, comprising means for performing a method in accordance with any one of Clauses 1-11.

Clause 14: A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform a method in accordance with any one of Clauses 1-11.

Clause 15: A computer program product embodied on a computer-readable storage medium comprising code for performing a method in accordance with any one of Clauses 1-11.

Additional Considerations

The preceding description is provided to enable any person skilled in the art to practice the various embodiments described herein. The examples discussed herein are not limiting of the scope, applicability, or embodiments set forth in the claims. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments. For example, changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For instance, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Also, features described with respect to some examples may be combined in some other examples. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.

As used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.

As used herein, a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c).

As used herein, the term “determining” encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Also, “determining” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Also, “determining” may include resolving, selecting, choosing, establishing and the like.

The methods disclosed herein comprise one or more steps or actions for achieving the methods. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims. Further, the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software component(s) and/or module(s), including, but not limited to a circuit, an application specific integrated circuit (ASIC), or processor. Generally, where there are operations illustrated in figures, those operations may have corresponding counterpart means-plus-function components with similar numbering.

The following claims are not intended to be limited to the embodiments shown herein, but are to be accorded the full scope consistent with the language of the claims. Within a claim, reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more. No claim element is to be construed under the provisions of 35 U.S.C. § 112(f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for.” All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. 

What is claimed is:
 1. A method of performing operations with a probabilistic numeric convolutional neural network, comprising: defining a Gaussian Process based on a mean and a covariance of input data; applying a linear operator to the Gaussian Process to generate pre-activation data; applying a nonlinear operation to the pre-activation data to form activation data; and applying a pooling operation to the activation data to generate an inference.
 2. The method of claim 1, wherein the inference comprises a random variable.
 3. The method of claim 1, wherein applying a linear operator to the Gaussian process comprises applying a diffusion equation to the Gaussian process.
 4. The method of claim 1, further comprising: calculating a loss based on the inference; and training parameters of the linear operator based on the loss.
 5. The method of claim 4, wherein the loss comprises a cross entropy component.
 6. The method of claim 5, wherein: the loss further comprises a marginal log likelihood component associated with the Gaussian process, and the method further comprises training parameters of the Gaussian process based on the marginal log likelihood component associated with the Gaussian process.
 7. The method of claim 6, wherein training parameters of the linear operator comprises performing gradient descent on the parameters of the linear operator.
 8. The method of claim 7, wherein training parameters of the Gaussian process comprises performing gradient descent on the parameters of the Gaussian process.
 9. The method of claim 1, wherein the nonlinear operation comprises a probabilistic ReLU operation.
 10. The method of claim 1, wherein the input data comprises irregularly sampled data.
 11. A processing system, comprising: a memory comprising computer-executable instructions; and one or more processors configured to execute the computer-executable instructions and cause the processing system to: define a Gaussian Process based on a mean and a covariance of input data; apply a linear operator to the Gaussian Process to generate pre-activation data; apply a nonlinear operation to the pre-activation data to form activation data; and apply a pooling operation to the activation data to generate an inference.
 12. The processing system of claim 11, wherein the inference comprises a random variable.
 13. The processing system of claim 11, wherein in order to apply a linear operator to the Gaussian process, the one or more processors are further configured to cause the processing system to apply a diffusion equation to the Gaussian process.
 14. The processing system of claim 11, wherein the one or more processors are further configured to cause the processing system to: calculate a loss based on the inference; and train parameters of the linear operator based on the loss.
 15. The processing system of claim 14, wherein the loss comprises a cross entropy component.
 16. The processing system of claim 15, wherein: the loss further comprises a marginal log likelihood component associated with the Gaussian process, and the one or more processors are further configured to cause the processing system to train parameters of the Gaussian process based on the marginal log likelihood component associated with the Gaussian process.
 17. The processing system of claim 16, wherein in order to training parameters of the linear operator, the one or more processors are further configured to cause the processing system to perform gradient descent on the parameters of the linear operator.
 18. The processing system of claim 17, wherein in order to train parameters of the Gaussian process, the one or more processors are further configured to cause the processing system to perform gradient descent on the parameters of the Gaussian process.
 19. The processing system of claim 11, wherein the nonlinear operation comprises a probabilistic ReLU operation.
 20. The processing system of claim 11, wherein the input data comprises irregularly sampled data.
 21. A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by a processor of a processing system, cause the processing system to perform a method of training a probabilistic numeric neural network, the method comprising: defining a Gaussian Process based on a mean and a covariance of input data; applying a linear operator to the Gaussian Process to generate pre-activation data; applying a nonlinear operation to the pre-activation data to form activation data; and applying a pooling operation to the activation data to generate an inference.
 22. The non-transitory computer-readable medium of claim 21, wherein the inference comprises a random variable.
 23. The non-transitory computer-readable medium of claim 21, wherein applying a linear operator to the Gaussian process comprises applying a diffusion equation to the Gaussian process.
 24. The non-transitory computer-readable medium of claim 21, wherein the method further comprises: calculating a loss based on the inference; and training parameters of the linear operator based on the loss.
 25. The non-transitory computer-readable medium of claim 24, wherein the loss comprises a cross entropy component.
 26. The non-transitory computer-readable medium of claim 25, wherein: the loss further comprises a marginal log likelihood component associated with the Gaussian process, and the method further comprises training parameters of the Gaussian process based on the marginal log likelihood component associated with the Gaussian process.
 27. The non-transitory computer-readable medium of claim 26, wherein: training parameters of the linear operator comprises performing gradient descent on the parameters of the linear operator, and training parameters of the Gaussian process comprises performing gradient descent on the parameters of the Gaussian process. 